harmful prompt
- North America > United States > California > San Francisco County > San Francisco (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (0.27)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
Verma, Sahil, Hines, Keegan, Bilmes, Jeff, Siska, Charlotte, Zettlemoyer, Luke, Gonen, Hila, Singh, Chandan
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, Omniguard is also very efficient ($\approx\!120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
Large Language Models (LLMs) exhibit remarkable capabilities but remain vulnerable to adversarial manipulations such as jailbreaking, where crafted prompts bypass safety mechanisms. Understanding the causal factors behind such vulnerabilities is essential for building reliable defenses. In this work, we introduce a unified causality analysis framework that systematically supports all levels of causal investigation in LLMs, ranging from token-level, neuron-level, and layer-level interventions to representation-level analysis. The framework enables consistent experimentation and comparison across diverse causality-based attack and defense methods. Accompanying this implementation, we provide the first comprehensive survey of causality-driven jailbreak studies and empirically evaluate the framework on multiple open-weight models and safety-critical benchmarks including jailbreaks, hallucination detection, backdoor identification, and fairness evaluation. Our results reveal that: (1) targeted interventions on causally critical components can reliably modify safety behavior; (2) safety-related mechanisms are highly localized (i.e., concentrated in early-to-middle layers with only 1--2\% of neurons exhibiting causal influence); and (3) causal features extracted from our framework achieve over 95\% detection accuracy across multiple threat types. By bridging theoretical causality analysis and practical model safety, our framework establishes a reproducible foundation for research on causality-based attacks, interpretability, and robust attack detection and mitigation in LLMs. Code is available at https://github.com/Amadeuszhao/SOK_Casuality.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (6 more...)
The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
Tan, Yuting, Huang, Yi, Li, Zhuo
Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > China (0.04)
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models
Piras, Giorgio, Mura, Raffaele, Brau, Fabio, Oneto, Luca, Roli, Fabio, Biggio, Battista
Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.46)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Law (0.93)
- Government (0.93)
LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Mura, Raffaele, Piras, Giorgio, Lukošiūtė, Kamilė, Pintor, Maura, Karbasi, Amin, Biggio, Battista
Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
- North America > United States (0.46)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Europe > Italy > Sardinia > Cagliari (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Jeung, Wonje, Yoon, Sangyeon, Kahng, Minsuk, No, Albert
Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.
- North America > United States (0.04)
- Europe > Italy (0.04)
- Banking & Finance (0.47)
- Law (0.46)
Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
Ham, Seokil, Choi, Yubin, Yang, Yujin, Cho, Seungju, Kim, Younghun, Kim, Changick
Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-T eacher (Ref-T eacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS. Recent advancements in Large Language Models (LLMs) (Touvron et al. (2023); Jiang et al. (2023); Team et al. (2024); Team (2024); Hurst et al. (2024); Guo et al. (2025); Research et al. (2025)) have achieved remarkable performance across a wide range of natural language processing tasks. LLMs are typically pretrained on massive and diverse corpora, resulting in strong generalization ability and broad applicability across domains. To further facilitate LLMs for individual and domain-specific purposes, major AI service providers such as Google and OpenAI offer not only access to pretrained LLMs but also Finetuning-as-a-Service (FaaS). This service enables users to upload custom datasets and adapt LLMs to more specific tasks and domains depending on their unique requirements. However, FaaS must prevent the malicious use of LLMs through safety-alignment, even when users attempt to jailbreak the models via customization. These types of attacks, which inject harmful prompts into user data for finetuning, are called harmful finetuning attacks. Several studies (Qi et al. (2023); Lermen et al. (2023); Rosati et al. (2024); Huang et al. (2024b;c;d); Li et al. (2025); Huang et al. (2025)) have shown that finetuning on user data containing harmful content compromises the safety-alignment, despite the LLMs being safety-aligned before finetuning.
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Europe > United Kingdom (0.14)
- North America > United States > Texas > Travis County > Austin (0.14)
- Africa (0.14)
- (30 more...)
- Personal > Honors (0.92)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.67)
- Media > Music (1.00)
- Media > Film (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- (9 more...)